-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix out-of-bounds access in module_diag_hailcast.F90 which crashes RRFS on WCOSS2 #2064
Conversation
This is a draft because I discovered I used the wrong fix. I need to do this instead: - DO k=KBAS,nz
+ DO k=KBAS+1,nz
RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
ENDDO I'm testing this now on WCOSS2 and Hera. |
The correct fix works too. This is ready for review. |
@SamuelTrahanNOAA do you think we can combine this pr to @dustinswales 's #2035 ? All no baseline changes. For this pr, just need to point to the cubed-sphere branch. |
This PR needs more testing and review before it can be merged. The original developer of the code needs to confirm my fix is correct. Also, the parallels are still testing the change. |
I've changed this to a draft to reflect the fact that it needs more review. |
There are more errors in the module_diag_hailcast.F90 code than I first thought. I'm waiting to hear back from the original developer about how to fix them. |
@jkbk2004 - This has been ready for review for several days. I forgot to mark it so. |
@SamuelTrahanNOAA can you sync up branch? wcoss2 file system maintenance this week. So we may go ahead with no baseline change PR first. |
@jkbk2004 - Yes, it is up to date now. |
@zach1221 @FernandoAndrade-NOAA @BrianCurtis-NOAA FYI: this PR is ready. |
If we're skipping WCOSS2 we can't properly test this PR, right? |
@BrianCurtis-NOAA @SamuelTrahanNOAA I think we should move on to merge this PR even only with RDHPCS tests. Government shutdown risk is not still cleared out. In that case, GFDL feds will not be available. |
There has been some minimal testing on WCOSS with RRFS. That isn't as thorough as a regression test. Is Acorn available? |
@SamuelTrahanNOAA was this issue also on Acorn. I would be OK if the WCOSS2 issue is present here, and using the Acorn RT to confirm it's fixed. |
I'm more concerned about whether the results change for other tests. We know this doesn't stop all the crashes. That's why Anning has another fix. There's a bug fix in the DA code. Most likely, there'll be a dozen more bug fixes before this is operational. |
I've retested it with one of the failing cases. The test was on Hera, but it fails in debug mode reliably on Hera without the fix. The latest fix still works. |
For now, Acorn is a full pass. Not happy with these long term WCOSS2 outages |
Acorn was created to lessen the impact of these outages, and also security concerns. WCOSS2's predecessors didn't have a mid-sized test system. You are witnessing the Acorn test machine serving its intended purpose. |
OK! I think we can move on for merging process. @SamuelTrahanNOAA I am moving to NOAA-GFDL/GFDL_atmos_cubed_sphere#308 to ask merging there. |
@SamuelTrahanNOAA FV3 pr was merged. |
I've reverted the .gitmodules changes and updated the FV3 hash to NOAA-EMC develop's head. This PR is ready for final review and merge. |
Commit Queue Requirements:
PR Information
Description
Fixes a bug that can crash the RRFS ensembles. When KBAS=1, there's an out-of-bounds write in an array. That corrupts memory and occasionally crashes the model.
Also, I remove from default_vars.sh the unused DO_MYJPBL variable whose value contains a typo.
Commit Message
Priority
Blocking Dependencies
Git Issues Fixed By This PR
Changes
Subcomponent (with links)
Input data
Regression Tests:
Libraries
Testing Log: